Mapping of Sequence Reads to the Reference Genomes ◾ 57
$ 10
A $ 9
GA $ 8
GGA $ 7
TGGA $ 6
CTGGA $ 5
GCTGGA $ 4
GGCTGGA $ 3
TGGCTGGA $ 2
TTGGCTGGA $ 1
CTTGGCTGGA $ 0
Notice that each line includes a suffix (key) and a position (value). Then from the key-value
pairs, we can construct a tree made up of nodes and edges. The positions (values) will be
the nodes and suffixes (keys) will be the edges of the tree. The suffix tree is built as shown
in Figure 2.5, starting from the first suffix on the top and it moves down to make branched
nodes and edges to avoid repeating common characters. This way, we will construct a
suffix tree with nodes and edges. An entire reference genome can be divided into suffixes
and stored this way with both suffixes and indexed positions in the unbranched nodes so
finding a pattern or a position of a read in the reference genome will be easy.
Once the reference sequence is indexed using the suffix tree, one of several searching
algorithms can be used to find the location where a read maps. For instance, to find “TGG”
in Figure 2.5, we will start searching from the root looking for “T”, and from the next node,
we will look for “GG”; thus, since there are two leaf nodes with the indexes 2 and 6, that
means “TGG” is aligned to the reference sequence in the positions 2 and 6 as shown by the
red color (Figure 2.5).
2.2.3 Suffix Arrays
The suffix array (SA) is similar to the suffix tree for pattern matching and finding sub-
strings (reads) and it can be constructed from the suffix tree. It is basically a sorted array
FIGURE 2.5 Nodes and edges of the suffix tree.